Added support for log_model='best_and_last' option in wandb logger #9356

ohayonguy · 2021-09-07T08:52:38Z

What does this PR do?

As a followup of the discussion #9342 (reply in thread)

Currently, the Weights and Biases logger is able to log checkpoints as artifacts either at the end of training or during training (any time a new checkpoint is created by the ModelCheckpoint callback). This feature can be controlled with the log_model flag, where log_model=True means to log at the end of training, and log_model=all means to log all checkpoints during training. However, the latter option can blow up the artifacts storage as many checkpoints can be saved, which is undesirable especially when using wandb local

Thus, I added a log_model=best_and_last option, which keeps only the best and the last checkpoints in the wandb artifacts storage. Previous artifact versions are automatically deleted.

No extra dependencies are required. Only wandb.

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun? yup :)

Make sure you had fun coding 🙃

for more information, see https://pre-commit.ci

codecov · 2021-09-07T09:13:19Z

Codecov Report

Merging #9356 (e0b0fa7) into master (392c577) will decrease coverage by 0%.
The diff coverage is 67%.

@@          Coverage Diff           @@
##           master   #9356   +/-   ##
======================================
- Coverage      88%     88%   -0%     
======================================
  Files         178     178           
  Lines       14895   14900    +5     
======================================
+ Hits        13138   13141    +3     
- Misses       1757    1759    +2

justusschock · 2021-09-07T09:20:46Z

tests/loggers/test_wandb.py

+    logger.experiment.project_name.return_value = "project"
+    trainer = Trainer(default_root_dir=tmpdir, logger=logger, max_epochs=2, limit_train_batches=3, limit_val_batches=3)
+    trainer.fit(model)
+    assert wandb.init().log_artifact.call_count == 2


this does not test, that the other ones are properly removed. Can we also test/mock this somehow?

For sure. I will figure out how this should be properly tested.
Do you think that this should be done with mocking? @justusschock

Maybe you should train for 5 epochs in this test

justusschock · 2021-09-07T09:21:40Z

@borisdayma mind reviewing this PR?

tchaton · 2021-09-07T15:07:21Z

pytorch_lightning/loggers/wandb.py

+            api = wandb.Api(overrides={"project": self.experiment.project})
+
+            for version in api.artifact_versions(f"model-{self.experiment.id}", "model"):
+                # Clean up all versions that don't have an alias such as 'latest'.
+                if len(version.aliases) == 0:
+                    version.delete()


noob question. I am not familiar with wandb API.

It seems model are being versioned and your are deleting all versions which doesn't have either latest or best aliases.

I am not sure to grasp why this would save only best and last model weights.

Furthermore, I don't think this would work for multiple ModelCheckpoint. Should we save the monitor as metadata to perform the filtering.

best_and_last should produce at maximum num_model_checkpoints + 1 checkpoints right ?

We automatically tag some artifact versions (model checkpoints). We tag the "latest" and we tag the "best" when monitoring value is defined (they can point to the same model). So there are 2 versions tagged at most.

@borisdayma Yah exactly. The implementation relies on the fact that there are at most 2 live aliases at the same time.
Several aliases can probably be included (best_0, best_1, best_2, ..., latest), but this is another PR :)

I believe best_{monitor_0}, best_{monitor_1}, best_{monitor_2} would be better, it would enable users to navigate their weights better on the Wandb UI.

The idea is that we directly leverage ModelCheckpoint to identify best metrics (easier to maintain the callback, avoid replicating the same logic, and maybe easier for users).
You can see an example here.

borisdayma

Looks good to me. I'm just curious of the naming and if we could offer this functionality in a cleaner way.

An alternative could be to just add an argument such as "remove_unused_artifacts" to the logger which would be false by default.
Then we can just have a separate function for this purpose. What do you think?

ohayonguy · 2021-09-08T12:14:44Z

Looks good to me. I'm just curious of the naming and if we could offer this functionality in a cleaner way.

An alternative could be to just add an argument such as "remove_unused_artifacts" to the logger which would be false by default.
Then we can just have a separate function for this purpose. What do you think?

Great idea @borisdayma to make the naming cleaner. However, I think that we need to take this a step further. So we would pass log_model=all and remove_unused_artifacts=True to achieve our functionality? So then, what does "all" means, and what does it refer to? It's a bit confusing to me, and I think we need to come up with a solution that is easier to interpret in terms of what's happening in the background.

Let's examine the current functionality of log_model:

Setting this flag to True means to save all the checkpoints ONLY WHEN TRAINING FINISHES. Why is this behavior necessary at all? In what situation would this be preferred over saving checkpoints along the way and deleting old versions, like ModelCheckpoint does? The final result is anyways the same, and the disk space usage is also the same (in the end).
Setting this flag to all means to save the entire history of checkpoints.

Therefore, I suggest:

Change the naming of log_model to save_checkpoints_as_artifacts
save_checkpoints_as_artifacts=True would mean to save checkpoints exactly like ModelCheckpoint. This means that we should add aliases like best_0, ... best_k in case save_top_k>1 is passed... (what else should we take into consideration here?)
Add a flag keep_artifacts_history. If this flag is set, then checkpoints are not deleted from the artifacts history.

What do you think?
Thanks!

borisdayma · 2021-09-08T18:07:24Z

So my concern was related to people with limited network bandwidth, whether upload speed or max quota before being throttled.

I have no issue with changing arguments name but then it requires backward compatibility so better to avoid it unless it's clearly beneficial.

tchaton · 2021-09-09T08:39:08Z

tests/loggers/test_wandb.py

+    logger = WandbLogger(log_model="best_and_last")
+    logger.experiment.id = "1"
+    logger.experiment.project_name.return_value = "project"
+    trainer = Trainer(default_root_dir=tmpdir, logger=logger, max_epochs=2, limit_train_batches=3, limit_val_batches=3)


Should this test use a ModelCheckpoint ?

Yup. We should definitely build a better test. I am just not familiar with the whole "Mocking" thing, and with tests in general. Should my test function be also decorated with mocking?

If not, then this requires some wandb default setup to run the experiment. Not sure how to make this right as I never wrote tests for such a large project like Lightning :)

borisdayma · 2021-09-09T14:44:03Z

Ok so I really like the purpose of this PR.

I think there are a few things to decide:

should we be concerned about bandwidth, max upload quota for users? I personally don't have that issue but the reason we have a mode log_model=True that uploads only at the end is for that exact purpose
depending on the answer to previous function, we can rethink allowed values for log_model (right now True and all)
should the "artifact removal" be on by default? I don't have a strong opinion about it.
how to integrate it best? I like keep_artifacts_history (suggested by @ohayonguy ) as an arg in the callback and a separate function of the same name that will be called when the flag is on.
I suggest to keep the functionality best_1, best_2… when wanting to keep top k models in a separate PR as it is more complex and opens many more questions

ohayonguy · 2021-09-10T06:40:50Z

Ok so I really like the purpose of this PR.

I think there are a few things to decide:

should we be concerned about bandwidth, max upload quota for users? I personally don't have that issue but the reason we have a mode log_model=True that uploads only at the end is for that exact purpose

depending on the answer to previous function, we can rethink allowed values for log_model (right now True and all)

should the "artifact removal" be on by default? I don't have a strong opinion about it.

how to integrate it best? I like keep_artifacts_history (suggested by @ohayonguy ) as an arg in the callback and a separate function of the same name that will be called when the flag is on.

I suggest to keep the functionality best_1, best_2… when wanting to keep top k models in a separate PR as it is more complex and opens many more questions

Regarding the network bandwidth, I agree that this might be a problem to some people.
However, logging the checkpoints only at the end of training is not that safe. In addition, if someone really has a network bandwidth issue, they can reference artifacts to their local file system instead of uploading them to wandb (the checkpoints are saved by ModelCheckpoint anyways.
Despite all that, I think that we should still keep the functionality to upload only at the end of training (why not?).

How about this:

log_model could be either True or False. If it is set to True, then you can play with the way artifacts are uploaded with a different flag:
artifacts_history could be all or most_recent and it is self explanatory. This flag is relevant not only for log_model, but for all possible artifacts. This logger might be expanded in the future to collect more kinds of artifacts, and this flag will determine whether history is preserved or not, for all artifacts.

What do you think? @borisdayma @tchaton As in all engineering, there will be a tradeoff eventually...

tchaton · 2021-09-10T09:51:44Z

Ok so I really like the purpose of this PR.

I think there are a few things to decide:

should we be concerned about bandwidth, max upload quota for users? I personally don't have that issue but the reason we have a mode log_model=True that uploads only at the end is for that exact purpose

depending on the answer to previous function, we can rethink allowed values for log_model (right now True and all)

should the "artifact removal" be on by default? I don't have a strong opinion about it.

how to integrate it best? I like keep_artifacts_history (suggested by @ohayonguy ) as an arg in the callback and a separate function of the same name that will be called when the flag is on.

I suggest to keep the functionality best_1, best_2… when wanting to keep top k models in a separate PR as it is more complex and opens many more questions

Hey @borisdayma, quick question about Wandb internals. Are the model uploaded within a separated thread / process ? How would be the impact of training to log all models ?

Best,
T.C

borisdayma · 2021-09-14T03:45:11Z

Hey @borisdayma, quick question about Wandb internals. Are the model uploaded within a separated thread / process ? How would be the impact of training to log all models ?

It is performed in a separate process so should not impact training at all (even if the wandb process were to fail).

How about this:

log_model could be either True or False. If it is set to True, then you can play with the way artifacts are uploaded with a different flag:

artifacts_history could be all or most_recent and it is self explanatory. This flag is relevant not only for log_model, but for all possible artifacts. This logger might be expanded in the future to collect more kinds of artifacts, and this flag will determine whether history is preserved or not, for all artifacts.

That's a great idea! In addition I suggest:

log_model=True logs during training by default (more convenient), so it's the same as the current log_model='all'
log_model='all' still works but will be deprecated in the future (warning for now)
logging only at the end needs a new flag or a new argument

tchaton · 2021-09-14T10:37:08Z

Hey @borisdayma, @ohayonguy,

Yes, I like the proposal too. Should we adapt this PR directly with the suggested changes ?

Best,
T.C

borisdayma · 2021-09-14T14:32:36Z

As you guys want! I can work on it later this week unless @ohayonguy wants to update this PR?

ohayonguy · 2021-09-16T10:14:09Z

As you guys want! I can work on it later this week unless @ohayonguy wants to update this PR?

Hey guys @borisdayma @tchaton
I can definitely work on it (and would be happy to). However it could take a while as I am currently super busy with my internship.
Sorry for the late response.

stale · 2021-10-12T04:43:00Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2021-10-17T08:06:31Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

sh3rlock14 · 2023-03-06T13:58:34Z

Hi everyone 👋🏻!
I see that this PR is now closed and the proposed feature couldn't make it into newer versions of PL: did you chose to keep the behaviour of WandbLogger as it is, or is there something new that is been developing?

Thanks for your help!

Buckler89 · 2025-02-07T10:36:37Z

Any updates on this? Will this feature be merged?
I think this is important since logging the model during training helps protect against crashes and allows the experiment to be resumed from its last saved checkpoint.

taczin · 2025-02-21T10:42:45Z

+1 for this feature. Anyone wanting to train a large model will be interested in this. Due to the large model size keeping a full history of checkpoints will take up too much disk space but waiting for end of training to upload top k model checkpoints is too vulnerable to crashes.

ohayonguy added 3 commits September 7, 2021 11:25

Added support for log_model='best_and_last' option in wandb logger

89d0d7b

fixed test_wandb.py

55bdc95

updated changelog

723a47b

ohayonguy requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren, tchaton and williamFalcon as code owners September 7, 2021 08:52

[pre-commit.ci] auto fixes from pre-commit.com hooks

e0b0fa7

for more information, see https://pre-commit.ci

justusschock reviewed Sep 7, 2021

View reviewed changes

tchaton added the wandb label Sep 7, 2021

tchaton reviewed Sep 7, 2021

View reviewed changes

borisdayma reviewed Sep 7, 2021

View reviewed changes

tchaton reviewed Sep 9, 2021

View reviewed changes

mergify bot added the has conflicts label Sep 10, 2021

Borda added the 3rd party Related to a 3rd-party label Sep 23, 2021

Borda removed the wandb label Sep 23, 2021

awaelchli added feature Is an improvement or enhancement logger Related to the Loggers labels Sep 27, 2021

stale bot added the won't fix This will not be worked on label Oct 12, 2021

stale bot closed this Oct 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for log_model='best_and_last' option in wandb logger #9356

Added support for log_model='best_and_last' option in wandb logger #9356

ohayonguy commented Sep 7, 2021 •

edited

Loading

codecov bot commented Sep 7, 2021

justusschock Sep 7, 2021

ohayonguy Sep 7, 2021 •

edited

Loading

borisdayma Sep 7, 2021

justusschock commented Sep 7, 2021

tchaton Sep 7, 2021

borisdayma Sep 7, 2021

ohayonguy Sep 8, 2021

tchaton Sep 9, 2021

borisdayma Sep 9, 2021

borisdayma left a comment •

edited

Loading

ohayonguy commented Sep 8, 2021

borisdayma commented Sep 8, 2021

tchaton Sep 9, 2021

ohayonguy Sep 9, 2021

borisdayma commented Sep 9, 2021

ohayonguy commented Sep 10, 2021

tchaton commented Sep 10, 2021

borisdayma commented Sep 14, 2021

tchaton commented Sep 14, 2021

borisdayma commented Sep 14, 2021

ohayonguy commented Sep 16, 2021

stale bot commented Oct 12, 2021

stale bot commented Oct 17, 2021

sh3rlock14 commented Mar 6, 2023

Buckler89 commented Feb 7, 2025

taczin commented Feb 21, 2025

Added support for log_model='best_and_last' option in wandb logger #9356

Added support for log_model='best_and_last' option in wandb logger #9356

Conversation

ohayonguy commented Sep 7, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun? yup :)

codecov bot commented Sep 7, 2021

Codecov Report

Choose a reason for hiding this comment

ohayonguy Sep 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justusschock commented Sep 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borisdayma left a comment • edited Loading

Choose a reason for hiding this comment

ohayonguy commented Sep 8, 2021

borisdayma commented Sep 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borisdayma commented Sep 9, 2021

ohayonguy commented Sep 10, 2021

tchaton commented Sep 10, 2021

borisdayma commented Sep 14, 2021

tchaton commented Sep 14, 2021

borisdayma commented Sep 14, 2021

ohayonguy commented Sep 16, 2021

stale bot commented Oct 12, 2021

stale bot commented Oct 17, 2021

sh3rlock14 commented Mar 6, 2023

Buckler89 commented Feb 7, 2025

taczin commented Feb 21, 2025

ohayonguy commented Sep 7, 2021 •

edited

Loading

ohayonguy Sep 7, 2021 •

edited

Loading

borisdayma left a comment •

edited

Loading